智能论文笔记

Background: Encouraged by the success of pretrained Transformer models in many natural language processing tasks, their use for International Classification of Diseases (ICD) coding tasks is now actively being explored. In this study, we investigate three types of Transformer-based models, aiming to address the extreme label set and long text classification challenges that are posed by automated ICD coding tasks. Methods: The Transformer-based model PLM-ICD achieved the current state-of-the-art (SOTA) performance on the ICD coding benchmark dataset MIMIC-III. It was chosen as our baseline model to be further optimised. XR-Transformer, the new SOTA model in the general extreme multi-label text classification domain, and XR-LAT, a novel adaptation of the XR-Transformer model, were also trained on the MIMIC-III dataset. XR-LAT is a recursively trained model chain on a predefined hierarchical code tree with label-wise attention, knowledge transferring and dynamic negative sampling mechanisms. Results: Our optimised PLM-ICD model, which was trained with longer total and chunk sequence lengths, significantly outperformed the current SOTA PLM-ICD model, and achieved the highest micro-F1 score of 60.8%. The XR-Transformer model, although SOTA in the general domain, did not perform well across all metrics. The best XR-LAT based model obtained results that were competitive with the current SOTA PLM-ICD model, including improving the macro-AUC by 2.1%. Conclusion: Our optimised PLM-ICD model is the new SOTA model for automated ICD coding on the MIMIC-III dataset, while our novel XR-LAT model performs competitively with the previous SOTA PLM-ICD model.

translated by 谷歌翻译

临床数据通常由于其高度机密性而无法自由分发，这阻碍了医疗保健领域的机器学习的发展。缓解此问题的一种方法是使用生成对抗网络（GAN）生成现实的合成数据集。然而，已知甘恩会遭受模式崩溃的困扰，从而产生低脱水量的输出。在本文中，我们扩展了经典的GAN设置，并具有外部内存，以重播真实样品的功能。使用抗逆转录病毒治疗作为人类免疫缺陷病毒（艾滋病毒的ART）作为案例研究，我们表明我们的扩展设置增加了收敛性，更重要的是，它有效地捕获了现实世界中临床数据常见的严重类别不平衡分布。

translated by 谷歌翻译

这两个合成数据集包括3,910名急性低血压患者的生命体征，实验室测试结果，施用流体荧光剂和血管加压剂，并在重症监护病房（ICU）中为2,164名败血症患者。使用先前发布的包含和排除标准建立患者群组，并使用生成的对抗网络（GANS）和MIMIC-III临床数据库创建数据。与这些数据的释放相关的身份公开的风险估计非常低（0.045％）。将数据集作为健康健身房的一部分发布，该项目旨在公开分布用于开发机器学习算法的合成纵向健康数据（特别关注离线强化学习）和教育目的。

translated by 谷歌翻译

De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models

Leibo Liu , Oscar Perez-Concha , Anthony Nguyen , Vicki Bennett , Louisa Jorm

分类：自然语言处理

2021-01-01

电子医疗记录（EMRS）包含对医学研究人员具有巨大潜在价值的临床叙述文本。但是，将该信息与个人身份信息（PII）混合，这会给患者和临床医生机密的风险带来风险。本文介绍了端到端的去除识别框架，以自动从医院排放摘要中删除PII。我们的语料库包括600名医院出院摘要，该摘要是从澳大利亚悉尼的两家主要推荐医院的EMRS中提取的。我们的端到端去识别框架由三个组件组成：1）注释：使用五个预定类别的600家医院放电摘要标记PII：人，地址，出生日期，识别号码，电话号码; 2）建模：培训六个命名实体识别（NER）深度学习基础 - 平衡和不平衡数据集;并评估组合所有六种基础型号的合奏，这三种基础模型，具有最佳的F1分数和三种基础型号，分别使用令牌级多数投票和堆叠方法分别具有最佳的召回分数; 3）去鉴定：从医院排放摘要中移除PII。我们的研究结果表明，使用堆叠支持向量机（SVM）方法在三种基础上使用最佳F1分数的堆栈模型实现了优异的结果，在我们的语料库的测试组上的F1得分为99.16％。我们还评估了2014年I2B2去识别数据集上的建模组件的稳健性。我们在所有六种基础型号上使用令牌级多数投票方法的集合模型，在严格的实体匹配中实现了96.24％的最高F1得分，并且在二进制令牌级匹配中的最高F1得分为98.64％，而二进制符合两个州-Of-最现实的方法。该框架提供了一种强大的解决方案，可以安全地去识别临床叙述文本。

translated by 谷歌翻译